The Annals of Applied Statistics 

2011, Vol. 5, No. 2B, 1512-1533 

DOI: 10.1214/10-AOAS436 

© Institute of Mathematical Statistics, 2011 

POPULATION SIZE ESTIMATION BASED UPON RATIOS OF 
RECAPTURE PROBABILITIES 
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Estimating the size of an elusive target population is of promi- 
nent interest in many areas in the life and social sciences. Our aim 
is to provide an efficient and workable method to estimate the un- 
C ' known population size, given the frequency distribution of counts of 

Cn J repeated identifications of units of the population of interest. This 

counting variable is necessarily zero-truncated, since units that have 
^ , never been identified are not in the sample. We consider several ap- 

plications: clinical medicine, where interest is in estimating patients 
with adenomatous polyps which have been overlooked by the diag- 
nostic procedure; drug user studies, where interest is in estimating 
the number of hidden drug users which are not identified; veteri- 
^i ' nary surveillance of scrapie in the UK, where interest is in estimating 

the hidden amount of scrapie; and entomology and microbial ecology, 

where interest is in estimating the number of unobserved species of or- 

^- ' ganisms. In all these examples, simple models such as the homogenous 

C**) , Poisson are not appropriate since they do not account for present and 

Cn ■ latent heterogeneity. The Poisson-Gamma (negative binomial) model 

provides a flexible alternative and often leads to well-fitting models. 
It has a long history and was recently used in the development of 
J"""""- ' the Chao-Bunge estimator. Here we use a different property of the 

^^ | Poisson-Gamma model: if we consider ratios of neighboring Poisson- 

Gamma probabilities, then these are linearly related to the counts 
of repeated identifications. Also, ratios have the useful property that 
they are identical for truncated and untruncated distributions. In this 
paper we propose a weighted logarithmic regression model to estimate 
|r%J ' the zero frequency counts, assuming a Gamma-Poisson distribution 
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for the counts. A detailed explanation about the chosen weights and 
a goodness of fit index are presented, along with extensions to other 
distributions. To evaluate the proposed estimator, we applied it to the 
benchmark examples mentioned above, and we compared the results 
with those obtained through the Chao-Bunge and other estimators. 
The major benefits of the proposed estimator are that it is defined 
under mild conditions, whereas the Chao-Bunge estimator fails to 
be well defined in several of the examples presented; in cases where 
the Chao-Bunge estimator is defined, its behavior is comparable to 
the proposed estimator in terms of Bias and MSE as a simulation 
study shows. Furthermore, the proposed estimator is relatively in- 
sensitive to inclusion or exclusion of large outlying frequencies, while 
sensitivity to outliers is characteristic of most other methods. The 
implications and limitations of such methods are discussed. 

1. Introduction. The size N of an elusive population must often be 
determined. Elusive populations occur, for example, in public health and 
medicine, agriculture and veterinary science, software engineering, illegal be- 
havior research, in the ecological sciences and in many other fields [Bishop, 
Fienberg and Holland (1995), Bunge and Fitzpatrick (1993), Chao et al. 
(2001), Hay and Smit (2003), Pledger (2000, 2005), Roberts and Brewer 
(2006), Wilson and Collins (1992)]. A prominent problem in public health is 
the completeness of a disease registry [Van Hest et al. (2008)], while an inter- 
esting application of capture-recapture techniques in the veterinary sciences 
is the estimation of hidden scrapie in Great Britain [Bolming and Del Rio 
Vilas (2008)]. In software engineering [Wohlin, Runeson, and Brantestam 
(1995)] we are interested in finding the number of errors hidden in software 
components. In criminology the number of people with illegal behavior is 
of high interest [Van der Heijden, Cruyff, and Houwelingen (2003)], and in 
ecology we wish to estimate the number of rare species of organisms [Chao et 
al. (2001)]. All of these situations fall under the following setting. We assume 
that there are N units in the population, which is closed (no birth, death or 
migration), and that there is an endogenous mechanism such as a register, 
a diagnostic device, a set of reviewers, or a trapping system, which identifies 
n distinct units from the population. A given unit may be identified exactly 
once, or it may be observed twice, three times, or more. We denote the num- 
ber of units observed i times by fi , so that n = /i + ji + /3 H ; the number 

of unobserved or missing units is /o, so N = /o + n. The objective is to find 
an estimate (or rather a prediction) /o for /o, and hence an estimate N of N. 

To illustrate, we first introduce several examples from different domains; 
these are analyzed in the following sections: 

1. Methamphetamine use in Thailand. Surveillance data on drug abuse are 
available for 61 health treatment centers in the Bangkok metropolitan 
region from the Office of the Narcotics Control Board (ONCB). Using 
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Table 1 
Methamphetamine data — frequency distribution of treatment episodes per drug user 
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this data, it was possible to reconstruct the counts of treatment episodes 
for each patient in the last quarter of 2001. Table 1 presents the number of 
methamphetamine users for each count of treatment episodes [Bohning 
et al. (2004)]; the maximum observed frequency was 10. Here we are 
interested in estimating the number of hidden methamphetamine users. 
2. Screening for colorectal polyps. In 1990, the Arizona Cancer Center ini- 
tiated a multicenter trial to determine whether wheat bran fiber can 
prevent the recurrence of colorectal adenomatous polyps [Alberts et al. 
(2000), Hsu (2007)]. Subjects with previous history of colorectal ade- 
nomatous polyps were recruited and randomly assigned to one of two 
treatment groups, low fiber and high fiber. The researchers noted that 
adenomatous polyp data are often subject to unobservable measurement 
error due to misclassification at colonoscopy. It can be assumed that 
patients with a positive polyp count were diagnosed correctly, whereas 
it is unclear how many persons with zero-count of polyps were false- 
negatively diagnosed. Thus, we approach the data as if zero-counts were 
not observed, and we try to estimate the undercount from the nonzero 
frequencies. Table 2 shows the polyp frequency data for the two different 
treatment groups; the (overall) maximum frequency is 77. The number of 
subjects with an observed number of adenomas equal to is 285 for the 
Low Fiber treatment and 381 for High Fiber treatment respectively; we 
regard this as an undercount and seek to estimate the true unobserved 
frequencies /q. 



Table 2 
Polyps data — frequency distribution of recurrent adenomatous polyps per patient, by 

treatment group 
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Table 3 

Scrapie data — frequency distribution of the scrapie count 

within each holding for Great Britain in 2005 
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3. Scrapie in Great Britain. Sheep are kept in holdings in Great Britain 
and the occurrence of scrapie in the population of holdings is monitored 
by the Compulsory Scrapie Flocks Scheme [Bohning and Del Rio Vilas 
(2008)]. This was established in 2004 and summarizes three surveillance 
sources. Table 3 presents the frequency distribution of the scrapie count 
within each holding for the year 2005. Here interest is estimating /o, 
the frequency of holdings with unobserved or unreported scrapie. The 
maximum frequency in the data is 8. 

4. Malayan butterfly data. This data set derives from a large collection of 
Malayan butterflies collected by A. S. Corbet in 1942 [Fisher, Corbet 
and Williams (1943)]. There were 9031 individual butterflies classified to 
n = 620 species. Out of these 620 different species, 118 were observed 
exactly once, 74 twice, 44 three times and so forth. This "abundance" 
data is shown in Table 4. Fisher, Corbet and Williams (1943) reported 
exact counts only up to /24, stating that there were a total of 119 species 
with sample abundances (counts) greater than 24. Here the interest is in 
estimating the total number of species N. 

5. Microbial diversity in the Gotland Deep. The data on microbial diversity 
shown in Table 5 stem from a recent work by Stock et al. (2009). Mi- 
crobial ecologists are interested in estimating the number of species N 
in particular environments. Unlike butterflies, microbial species member- 
ship is not clear from visual inspection, so individuals are defined to be 
members of the same species (or more general taxonomic group) if their 
DNA sequences (derived from a certain gene) are identical up to some 
given percentage, 95% in this case. Here the study concerned protistan 
diversity in the Gotland Deep, a basin in the central Baltic Sea. The 
sample was collected in May 2005. The maximum observed frequency 
was 53. 

Table 4 
Butterfly data — frequency distribution of butterfly species collected in Malaya 
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Table 5 
Protistan diversity in the Gotland Deep — frequency counts of observed species 
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The classical approach to estimation of N is to assume that each popula- 
tion unit enters the sample independently with probability p (dealing with 
heterogeneous capture probabilities by modeling and averaging). Given p, 
the unbiased Horvitz-Thompson estimator of N is n/p, and the maximum 
likelihood estimator is its integer part [n/p\ . One then estimates p using any 
of several methods, and the final estimate of N is n/p or \n/p\ [Lindsay and 
Roeder (1987), Bohning et al. (2005), Bohning and van der Heijden (2009), 
Wilson and Collins (1992), Bunge and Barger (2008), Chao (1987, 1989), 
Zelterman (1988)]. 

Here we take a new approach: we consider ratios of successive frequency 
counts, namely, 

(x + l)f x+ i 
r(x):= . 

Jx 

Often r(x) appears as a roughly linear function of x, which leads us to apply 
linear regression to the scatterplot of (x,f(x)); we then project the regression 
function downward to the left, to zero, which yields ft and hence N. Figure 1 
shows the ratio plot of (x,f(x)) for the methamphetamine data; there is 
clear evidence for a linear trend. Projecting the line to the left, we obtain 
/o = 57,788 and, hence, N = 61,133. 

Figure 2 shows the ratio plot for the butterfly data; again there is a clear 
linear trend and here we also observe increasing variance in the points as x 
increases, which we will deal with via weighted least squares. In this case we 
find /o = 126 and N = 746. 

This simple and powerful method applies exactly when the frequency 
counts emanate from the Katz family of distributions, namely, the bino- 
mial, Poisson and gamma-mixed Poisson or negative binomial, and it ap- 
plies approximately to extensions of the Katz family and to general Poisson 
mixtures. It can be implemented using any statistical software package that 
performs weighted least squares regression, and it is superior to existing 
methods for the negative binomial model (including maximum likelihood) 
in several ways. In addition, it substantially mitigates the effect of truncating 
large counts (recaptures or replicates), which is an issue with almost every 
existing method, parametric or nonparametric. In Section 2 we discuss the 
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Fig. 1. Scatterplot with regression line of (x + l)f^ x+ i- ) /f x vs. x for the Bangkok metham- 
phetamine drug user data. 

method and its scope of applicability; in Section 3 we describe weighting 
schemes; in Section 4 we look at goodness of fit of the linear model; and 
in Section 5 we compare our method with existing techniques, analyze the 
five data sets, and discuss the implications of our findings. The Appendix 
covers aspects of the approximation used for reaching the linear model as 
well as a comparative simulation study, a discussion of standard error ap- 
proximations, and an assessment of the effect of deleting large "outlying" 
frequencies. 
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Fig. 2. Scatterplot with regression line of (x + l)f( x+ i)/f x vs. x for the butterfly data. 
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2. Linear regression and the Katz distributions. Let Po,pi,P2, ■ ■ ■ denote 
a probability distribution on the nonnegative integers. The condition 

(2.1) (X + lW = 7 + fa, x = 0,l,2,..., 

Px 

where 7 and 8 are real constants, characterizes the Katz family of dis- 
tributions [Johnson, Kemp and Kotz (2005)]. To yield a valid probability 
distribution, it is necessary that 7 > and 5 < 1. If 5 < 0, p x is the binomial 
distribution; if 5 = 0, p x is the Poisson; and if 5 € (0, 1), p x is the negative 
binomial. These distributions arise naturally as models for population size 
estimation. 

• Suppose that a given population unit may be observed on each of k "trap- 
ping occasions." Assume further that the trapping or capture probability, 
say, r, is the same on each occasion and that captures are independent 
across occasions, and also that the capture probability is the same (ho- 
mogeneous) for all units, and that units are captured independently of 
each other. If mi denotes the number of captures of the ith unit, then 
mi, . . . , rriM are i.i.d. binomial (k, r) random variables. This simple model 
is rarely realistic, but it can provide a lower bound for the population 
size, since the homogeneity assumption leads to downwardly biased esti- 
mation in the presence of heterogeneity. This is formally proved in Bohning 
and Schon (2005) for maximum likelihood estimation. In this case the fre- 
quency count data f\ , fi , . . . summarizes the nonzero values of m\ , . . . , m^ . 

• Now suppose that population unit i appears a random number of times 
rrii in the sample, but now mi, . . . , ttin are i.i.d. Poisson random variables 
with (homogeneous) mean A. This model arises naturally in species abun- 
dance sampling where each species contributes some number of represen- 
tatives to the sample; it also appears as an approximation to the binomial 
model with A ~ kr, for large k and small r. Again the homogeneity makes 
this model mainly useful for lower-bound benchmarking. 

• Assume now that the foregoing Poisson model holds, but with the mod- 
ification that the mean number of appearances of unit i is Aj, and that 
Ai, . . . , Atv are i.i.d. gamma-distributed random variables. Then the distri- 
bution of rrii is (unconditionally) gamma-mixed Poisson, that is, negative 
binomial. This is not the simplest possible model with heterogeneous cap- 
ture rates, but it may be the oldest, appearing in Fisher, Corbet and 
Williams (1943), the source of our butterfly data. (Note that it includes 
the geometric, since the exponential is a special case of the gamma.) The 
negative binomial distribution is widely applicable as a model for the fre- 
quency counts, when the data is not too highly skewed (left or right); 
however, it is surprisingly difficult to fit by, for example, maximum likeli- 
hood, or by other existing procedures such as the Chao-Bunge estimator 
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(see discussion below). We show below that, when implemented by our 
weighted least squares regression procedure, the negative binomial model 
becomes practical and useful for estimating N in a variety of situations. 

We make two further comments on distribution theory. First, it may be 
readily shown using the Cauchy-Schwarz inequality that the ratio on the 
left-hand side of (2.1) is nondecreasing for any mixed-Poisson distribution. 
This means that the linear relation, and hence our weighted linear regression 
procedure below, can be regarded as a first-order linear approximation for 
any Poisson mixture (not just gamma), thus justifying a degree of robustness 
of our method across a wide range of heterogeneity models. Second, there 
are extended versions of relation (2.1) which give rise to distributional ex- 
tensions of the Katz family that need not be mixed-Poisson [Johnson, Kemp 
and Kotz (2005)]. Such extensions may be parameterized and we conjecture 
that our method below will be robust to small perturbations along these 
parameters. 

Condition (2.1) suggests linear regression of the left-hand side upon the 
right, in some form. Observe that the natural estimate of p x would be 
p x (N) := fx/N, if N were known. But 

(x + l)p x+ i(N) _ (x + l)f x+1 /N _ (x + l)f x+1 _ 

p x (N) U/N f x nXh 

so we can fit a linear regression of f(x) on x without knowing N. We can 
then obtain an estimate of /o by setting x = so that r(0) = I/1//0 = 7, 
and, hence, /o = /1/7. In practice, however, we prefer to fit the response on 
a logarithmic scale, which is approximately linear near the origin and avoids 
negative fitted values. Thus, our basic equation becomes 

f(x + l)p x+ i\ 
log = 7 + 5x, 

V Px J 
and we fit the model 

(2.2) l0g ( (X+ I )/ " 1 ) =7 + fa + £ - 

We consider this in terms of linear regression in the next section. The esti- 
mate of /o is then /q = /ie~ 7 . 

In particular, consider the gamma-mixed Poisson or negative binomial 
model for the count data. Let the negative binomial be parameterized as 

where k > and p G (0, 1). Similar to other areas such as Poisson regression, 
we need to apply a suitable transformation to avoid negative values for the 
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ratios which would lead to negative estimates for /o . The log-transformation 
is appropriate, although others are also possible. Transforming both sides, 
we obtain 

log{(x + l)p(x + l)/p(x)} = log(x + k) +log(l -p), 

but now the right-hand side is nonlinear in k. However, taking the first-order 
Taylor expansion of log(k + x) around k, we achieve 

log(k + x) w log(fc) + -x, 

so that we have log(x + k) + log(l — p) ~ log(l —p) + \og(k) + x/k. Note that 
this approximation is exact for x = (the point where we predict) and good 
for x = 1 (corresponding to the informative "singleton" frequency count). In 
the Appendix we discuss this approximation further, as well as alternatives. 
With reference to model (2.2), we have 7 = log(l — p) + log(/c) and 5 = 1/k. 
We focus on this model in the discussion below. 

Note also that due to the simple structure of the estimator /o = /1 exp(— 7), 
we can use conditioning [Bohning (2008)] in combination with the <5-method 
to give an approximate expression for the variance of /o as 

Var(/ ) « exp(- 7 ) 2 /i[Var( 7 )/ 1 + 1], 

where Yav(j) is the variance of the intercept estimator in the regression 
model. An approximation to the variance of N = fo + n is then [using the 
same technique and estimating Var(ra) = iV(l — po)po by nfo/N] 

(2.3) Var(iV) « nt + exp(- 7 ) 2 /i[Var( 7 )/ 1 + 1]. 

Standard errors are obtained by plugging in estimates for Var(7) and tak- 
ing the (overall) square root. These expressions may be imprecise for small 
sample sizes (<100) and in such cases the bootstrap might be preferable. 
We provide a simulation study on this aspect in the Appendix. 

3. Heteroscedasticity and weighted least squares. Model (2.2) does not 
satisfy the classical linear regression assumptions. In the first place, the 
response is discrete (although log-transformed), so we might consider a gen- 
eralized linear model such as Poisson or even negative binomial regression. 
However, this is inadvisable since an appropriate formulation as a gener- 
alized linear model leads to an autoregressive equation involving \ogf x as 
an additional offset term in the linear predictor. These kinds of models ex- 
perience difficulties in terms of the definition of the likelihood as well as in 
carrying out inference. Actually, residuals derived from model (2.2) typically 
show reasonable conformity with normal probability plots when the linear 
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model fits well (see Section 4 regarding goodness of fit). The issues of de- 
pendence and heteroscedasticity are more important, and we address these 
by using weighted least squares. We take 



J = (X r WX)~ 1 X T WY, 



where 



( lo §( h ) \ 
log(f) 



\^(&J 



X: 



f 1 
1 



m 



1/ 



and m is the maximum frequency used in the estimator (see Section 4 below 
regarding truncation of large frequencies). To reduce MSE, we wish to take 
W ~ (cov(Y))" 1 . To find cov(Y), assume that the distribution of the cell 
counts fi, . . . , f m is multinomial with cell probabilities tt = (tti, . . . , 7r m ) T . 
Then it is well known that f = (/i,...,/ m ) has covariance matrix X = 
n[A(-7r) — 7T7T ], where A(-7r) is a diagonal matrix with elements tt on the 
diagonal, and n = f\ -\ \- f m . Writing 



S = n[A(7r)-7rvr T ] = 
we see that E can be estimated as 

E = A(f) 



A(n7r) 



-nirnir 
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-f f 1 
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An application of the multivariate delta-method then shows that an estimate 
of cov(Y) is 



(3.1) 



V f (Y(f))£(Vf(Y(f))) 
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Note that this requires that only nonzero frequencies be used in the esti- 
mate. 
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The tridiagonal matrix (3.1) has a special structure, and Meurant (1992) 
gives an analytical formula for its inverse. In addition, a calculation based 
on the representation in Meurant's Theorem 2.3 indicates that it may be 
possible to drop the off-diagonal terms in cov(Y) with little loss of numerical 
precision for our purposes. This corresponds to our intuition that covariances 
between adjacent log-ratios may not play a large role in reducing MSE. Let 
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be the diagonal part of (3.1); we then suggest using (3.2) in our weighted 
regression model. This is computationally simpler, especially when dealing 
with a high number of recaptures. A small simulation study confirms the 
precision of this simplification, at least within the domain of the simulation. 
We computed the bias of N using the weighted regression model under three 
scenarios: with weights according to (3.1), according to (3.2) and according 
to W = I m -i [the (m — l)-dimensional identity matrix, i.e., unweighted]. 
Frequency data were drawn from a negative binomial distribution with pa- 
rameters p = 0.8 and k = 7, and replicated 1000 times. Table 6 shows results 
for N = 100 and N = 1000. It is clear that weighting is important in fitting 
the model: the unweighted regression model leads to potentially heavily bi- 
ased estimators of the population size, whereas the effect of ignoring the 
covariance between log(xf x /f x -i) an d log((x + l)f x+ i/f x ) is negligible. Fi- 
nally, we note that weighted least squares can introduce numerical problems, 



Table 6 

The effect of different weight matrices according to (3. 1), 

(3.2) and W = I m -i for frequency data from, the Negative 

Binomial distribution with parameters k — 7 , p — 0.8 
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especially in sparse-data situations [Bjorck (1996), Chapters 4 and 6]; how- 
ever, our design matrix has only rank 2 and our maximum frequency m is 
typically not too large, so we have not yet encountered such problems here. 
This is a topic for future research in this context. 

4. Model assessment and goodness of fit. The ratio plot shown in Fig- 
ure 1 is our main graphical tool for looking at goodness of fit of the linear 
regression model, and having fit the model, the standard diagnostic plots 
of residuals are also available. We also require a quantitative assessment of 
overall fit: R 2 could be used based on the response log(x + l)f x+ i/ f x , but in 
this setting it seems more appropriate to work on the original frequency of 
counts scale. In addition, we are looking for a measure which allows analysis 
of residuals. We therefore compare the observed frequencies with the esti- 
mated frequencies from the model, using the x 2 -statistic as a goodness-of-fit 
measure [Agresti (2002)]. The estimated frequencies based on the regression 
model are 

1 (X + I)f x + 1 „ 4 

Vx = log - = 7 + ox, 



1, 2, . . . , m, or, equivalently, 

(x + l)/x+l 



exp(y x ), 



fx 

where m is the "truncation point" or maximum frequency used in the anal- 
ysis (we return to this issue below). In general, the estimated ratios of fre- 
quencies fx+i/fx need not uniquely determine f x+ \ and f x , but in this case 

they do since f = fi/exptf) = /i/(/i//o)- This also shows that fi = fi, 
since f = /i/exp(7) = fi/exp(y ), andjience, /i = / exp(y ) = /l- Now, 
with /i given the equation 2/2//1 = 2/2//1 determines J2 uniquely, leading 
to the recursive relation f x+ \ = f x exp(y x ) / (x + 1), x = 1, 2, . . . ,m — 1. We 
then define our x 2 statistic as 



X 



2 _ V"^ \Jx Jx) 



E 



1 JX 



and simulations support that this has a \ 2 distribution with m — 2 degrees 
of freedom if the regression model y x = 7 + 5x is correct. Note that we 
have m unconstrained frequencies, since n = Y^=i fx is random, and we lose 
2 degrees of freedom due to estimating the intercept and slope parameters. 
Note also that the estimate of the intercept parameter fixes f\ = /1 , so that 
the degrees of freedom are indeed only reduced by 2. This approach has the 
benefit of gaining one degree of freedom when compared to a goodness-of-fit 
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measure based solely on the regression model which works with the m — 1 
values y x , x = 1, . . . , m — 1. 

This argument is conditional upon fixing the value of m, and, indeed, all 
known procedures for population size estimation truncate large "outlier" fre- 
quencies in some way. To illustrate, we return to the classical maximum like- 
lihood (ML) approach. Bunge and Barger (2008) describe a procedure which 
fits the desired distribution (here, the negative binomial) to the (nonzero) 
frequency count data by ML; the estimate of iV is then based upon the 
estimated parameter values of the distribution. Typically, parametric distri- 
butions can only be made to fit the data up to some truncation point m, 
beyond which the fit, as assessed by the classical Pearson x 2 test, falls off 
considerably; consequently, only frequencies up to m are used to obtain the 
estimate of N, and the number of units with frequencies greater than m 
is added to the estimate ex post facto. Bunge and Barger (2008) propose 
a goodness-of-fit criterion for selecting m, while the coverage-based non- 
parametric methods of Chao and co-authors fix m heuristically at 10 [see 
Chao and Bunge (2002)]. Our weighted linear regression approach also has 
the potential for loss of fit as m increases, depending on the realized struc- 
ture of the data, and again we can fix m prior to the analysis, and collapse 
all frequencies greater than this threshold to one value. Sensitivity of the 
various methods to the choice of m is a complex topic [Bunge and Barger 
(2008) compute all estimates at all possible values of m]; however, our data 
analyses below show that the weighted linear regression model is consider- 
ably less sensitive to m than its chief competitors in the negative binomial 
case, namely, ML and the Chao-Bunge estimator. 

Finally, we note that in the ML approach, if the negative binomial fit 
is less than ideal (although perhaps still acceptable), numerical maximum 
likelihood algorithms often do not converge, or converge to the edges of the 
parameter space, which in turn distorts the apparent fit. The regression- 
based method described here offers a more robust approach to parameter 
estimation, and appears not to be prone to the numerical problems which 
arise for maximum likelihood estimation under the negative binomial model. 
In fact, the negative binomial parameter estimates (p, k) derived from the 
regression model and could be used as starting values for a numerical search 
for the ML estimates. This is a topic for further research. 

5. Alternative estimators, data analyses and discussion. 

5.1. Alternative estimators. We first consider certain other options for 
the negative binomial model. 

• Maximum likelihood. This approach is well studied and has a long history 
[see Bunge and Barger (2008)], but as noted above, good numerical solu- 
tions for the model parameters (p, k) seem to be remarkably difficult to 
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obtain, even using reasonably sophisticated search algorithms with high- 
precision settings. In our experience we get good numerical convergence 
only when the frequency data is smooth and fits the negative binomial 
well, or the right-hand tail is fairly severely truncated. The latter issue 
causes the additional computational burden of investigating a diversity of 
truncation points, each involving numerical optimization. Nonetheless, we 
can obtain ML results for the negative binomial in some cases. The ML 
estimator Nml is consistent for N given that the model is correct. 

• Chao-Bunge. Let r denote the probability of observing a unit at least 
twice, that is, r = 1 — po — p\. Chao and Bunge (2002) developed a non- 
parametric estimator f for r, and on this basis proposed the estimator 

m j, 

for N. They showed that Nqb is consistent for N under the negative bino- 
mial model. However, in applied data analysis f may be very small or even 
negative, leading to very large or negative values of Nqb- This is one rea- 
son that Chao and Bunge set m = 10 (as noted above). In fact, iVcB fails 
roughly as often as Nml, although not necessarily in the same situations. 

• Chao. Chao (1987, 1989) proposed the nonparametric statistic 

which is valid as a (nonparametric) lower bound for N; we compute it 
here as a benchmark. Note that m = 2. 

We are currently investigating the asymptotic behavior of our estimator N 
in detail. Here we can make the following observations. First, assume that 
the upper frequency cutoff m is selected as m = max{j : fi> 0,i = 1, . . . ,j}, 
so that m is a random variable. For the unweighted case, that is, W = I m -\ 
in Section 3 above, it may be readily shown that N/N — > 1 in probability 
as N — > oo, when either Y = \{i + l)/j + i//j] and (i + l)pi + \/pi = 7 + Si (the 
Katz condition), or Y = [log((i + l)/i+i//i)J and log((i + l)p i+ i / pi) =^ + 5i. 
If W = (cov(Y))" 1 or a diagonal matrix with positive variances as entries 
(similar to those discussed in Section 3), then we conjecture that analogous 
results can be obtained (here W must be a function of m). The convergence 
question is more complex for a weight matrix W that is estimated and 
perhaps approximated further (as in Section 3), although we believe that 
a Slutsky-type argument will again yield the desired consistency result. In 
any case, we note again that from our practical experience a weighted esti- 
mator [even with estimated weights using (3.2)] increases the efficiency and 
reduces the bias of the estimator considerably compared to the unweighted 
one (cf. Table 6). 
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5.2. Data analyses. We applied the proposed regression method and the 
alternative procedures to the five data sets discussed above. The results are 
shown in Table 7. Here the cutoff m was selected for the weighted linear 
regression model by taking the first m at which f m > and f m +i = 0; for 
the ML procedure m was selected by a goodness-of-fit criterion described in 
Bunge and Barger (2008), and m = 10 for Nqb and N~ch- 

We observe first that N gives an answer in every case, unlike Nml and Ncb- 
For the methamphetamine data, although the x 2 p- value is low, the result 
appears reasonable, especially with reference to the Chao lower bound. For 
the polyps — low data, TV gives the most precise result, with good fit; for 
the polyps — high data, the same is true but with less good fit. Despite the 
goodness-of-fit test in the latter case, though, residuals plots for both polyps 
data sets indicate reasonable conformity with the linear model, as shown in 
Figure 3. For the scrapie data it is interesting to note that N gives a rea- 
sonable result with good fit while both Nml and Nqb fail. For the butterfly 
data, N is comparable to Ncb, with good fit of the linear model, while the 
ML result is only slightly above the lower bound, with poor fit, indicat- 
ing difficulty with the ML numerical search. Finally, for the microbial data, 
both A/ml and ./Vcb fail, while N < N<jh with poor fit, signaling that the 
data set is anomalous in some way (in fact, it is highly skewed left). Over- 
all, the weighted linear regression approach shows up well in contrast to its 
competitors for the negative binomial model. 

5.3. Discussion. The main challenge in population-size estimation is ar- 
guably heterogeneity, that is, the fact that in real applications the capture 
probabilities or sampling intensities of the population units are not all equal. 
The statistician must account for this in some way or risk the severe down- 
ward bias of procedures based on the assumption of homogeneity, that is, 

Table 7 

Data analyses. N = weighted linear regression model; Nml — negative binomial maximum 

likelihood estimate; Ncb = Chao-Bunge estimator; Ncn = Chao lower bound; 

SE = standard error; p = p-value from \ goodness-of-fit test; * = estimation failed 



Study 


N 


SE 


P 


/Vml 


SE 


V 


Ncb 


SE 


iVch 


Meth. 


61,133 


17,088.8 


0.000 


* 


* 


* 


* 


* 


33,090 


Polyps — 


495 


37.15 


0.340 


892 


342.3 


0.619 


668 


141.4 


458 


low 




















Polyps — 


513 


52.0 


0.001 


587 


77.2 


0.010 


584 


72.0 


511 


high 




















Scrapie 


459 


112.0 


0.298 


* 


* 


* 


* 


* 


353 


Butterflies 


746 


24.6 


0.200 


715 


19.9 


0.000 


757 


32.4 


714 


Microbial 


183 


35.9 


0.000 


* 


* 


* 


* 


* 


212 
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Oil "pure" binomial or Poisson models. Since the time of Fisher, Corbet and 
Williams (1943), considerable success has been achieved using mixed-Poisson 
models with various mixture distributions intended to model heterogeneity, 
including the gamma, lognormal, inverse Gaussian, Pareto, generalized in- 
verse Gaussian and, more recently, finite mixtures of point masses or of 
exponentials [Bunge and Barger (2008), Quince, Curtis and Sloan (2008), 
Bohning and Schon (2005)]. But the substantive applications, such as those 
described in our examples here, typically do not offer a theoretical basis for 
selection of a mixing distribution, so researchers have had to search ever 
further afield for flexible and adaptable heterogeneity models. This is partly 
due to a perception that the "classical" gamma-mixture or negative binomial 
model is too restrictive and difficult to fit, both statistically and numerically. 
However, existing mixed-Poisson-based procedures, whether frequentist 
or Bayesian, are almost all based on the likelihood of the frequency count 
data. Here we take a completely different perspective based on the Katz 
relationship (2.1), finding that in many cases the ratio of successive frequency 
counts f(x) = (x + l)f x +i/fx appears as an approximately linear function 
of x. This relationship holds exactly for the gamma-mixture or negative 
binomial, and provides an improved method both for fitting that model and 
for assessing its fit. Furthermore, from the data-analysis perspective, the 
linear relationship seems to hold across a wide variety of data sets; and 
from the theoretical perspective, we know that every mixed-Poisson has 
(at least) monotone increasing Katz ratios, and that the Katz distribution 
family itself admits extensions in several directions. We therefore believe that 
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this perspective — looking at the data via f{x) — opens up a new method of 
applying the negative binomial model to data, and that it gives us a view of 
a new and little-known territory for exploring the robustness and extensions 
of that model. 

APPENDIX: SIMULATION STUDY, STANDARD ERRORS AND 
DEPENDENCE ON THE TRUNCATION POINT 

A.l. Comparative simulation study. We begin with one further exten- 
sion. The suggested weighted linear regression estimator A depends on 
a first-order Taylor approximation which might not be good for larger values 
of x. One might consider a second-order approximation, but this leads to an 
estimator with large variance due to the functional relationship of x and x 2 . 
An alternative linear approximation is possible by developing log (A; + x) = 
log ((A; — 1) + (x + 1)) linearly around x + 1, leading to the approximation 

log(x+l) + (fc-l)/(x + l) 

and the regression model 

(A.i) log ( {x + y/ x+1 ) ~ lQ g( x + 1) = Y + *7(s + 1) + £ *- 

We call this the hyperbolic model (HM). The hyperbolic model is also of 
very simple structure and prediction is possible since the model is defined 
for x = leading to /o = f\/ exp(-y' + a). We denote the estimator based on 
this model by Ahm- 

In the following simulation comparison, then, we compare N, AhM) Acb 
and Nch- We generated counts from a negative binomial distribution with 
dispersion parameters equal to 1, 2, 4, 6 and 10 and event probability pa- 
rameter such that the associated mean matches 1. The population sizes to 
be estimated were A = 100 and A = 1000. For N = 1000 a case with a com- 
bination of /i = 0.5, k = 0.5 was included which we have observed as typical 
values in our data sets (Butterfly and Polyps data). A sample X\, . . . , Ajv of 
size A was generated from a negative binomial distribution with parameters 
as described above and the associated frequency distribution /o, /i, • • • , /m 
was determined; then /o was ignored and fi,...,f m were used to compute 
the various estimators. This process was repeated 1000 times and bias, vari- 
ance and MSE were calculated from the resulting values. The results are 
shown in Table 8. Clearly, A performs better than Ahm since the former 
always has smaller MSE than the latter. In fact, there are only three cases in 
which Ahm had smaller bias than A, namely, N = 1000 and k = 1, 2 as well 
as the combination ji = 0.5, k = 0.5, and the smaller bias here was balanced 
by the smaller variance of N. Hence, we do not consider Ahm any further. 
We see in addition that N and Acb overestimate the true size A = 100, 
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Table 8 

RMSE and Bias for estimators based upon the WLRM, the HM, the Chao-Bunge 

estimator and the lower bound estimator of Chao, N — 100 and N = 1000, 

k — 1,2, 4, 6, 10, where k is the dispersion parameter of the negative-binomial with mean 

(i = l. Chao-Bunge estimates have been computed only for positive values 



k 


WLRM 


HM 


Chao-Bunge 


Chao 


RMSE N = 100 










1 


25.36 


366.89 


1475.91 


27.60 


2 


31.93 


816.54 


1145.43 


21.14 


4 


37.93 


557.87 


585.20 


18.59 


6 


43.56 


800.57 


642.57 


18.21 


10 


54.72 


3453.55 


256.71 


18.47 


BIAS JV = 100 










1 


-10.03 


115.98 


81.08 


-21.33 


2 


4.39 


124.90 


52.11 


-11.49 


4 


12.22 


113.29 


31.37 


-4.89 


6 


15.23 


116.89 


30.60 


-2.07 


10 


16.93 


162.21 


17.01 


-0.30 


RMSE N = 1000 










1 


185.62 


247.96 


191.25 


251.28 


2 


87.11 


206.02 


117.80 


152.88 


4 


72.79 


176.69 


96.55 


93.04 


6 


75.81 


165.98 


86.61 


73.10 


10 


79.26 


161.73 


81.08 


59.70 


[i = 0.5, k = 0.5 


375.72 


576.80 


5247.90 


471.19 


BIAS N = 1000 










1 


-177.89 


92.68 


23.70 


-247.25 


2 


-59.9 


49.46 


12.88 


-145.51 


4 


-1.88 


-12.05 


9.96 


-78.53 


6 


13.26 


-42.45 


7.96 


-52.99 


10 


21.88 


-72.31 


7.28 


-31.75 


^ = 0.5,fc = 0.5 


-368.16 


192.00 


-145.47 


-468.33 



whereas Nch tends to underestimate. We need to point out that Nqb pro- 
duced many negative values, so its bias and RMSE were evaluated on the 
basis of the positive values. The bias of N is smaller than that of -/Vcb f° r 
N = 100, although this reverses for N = 1000, and the bias is of the same 
size as that of Nch f° r N = 100 and becoming smaller for N = 1000. Also, 
the RMSE of ./Vcb is a lot larger than that of N. The situation changes for 
N = 1000. In this case both the bias and MSE for N are lower than those 
from iVch for every value k of the dispersion parameter. We notice, however, 
that ./Vcb shows a reduced bias, but the RMSE of N is still smaller. Overall, 
we find that iV and iVcB are behaving somewhat similarly for larger popu- 
lation sizes; however, a major benefit of N is that it is well defined in the 
many situations where iVcB fails. 
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A. 2. Standard errors. In Table 9 we compare the standard error cal- 
culated from (2.3) with the true standard error. This was done by tak- 
ing 10,000 replications of N, say, Ni,i = 1, ..., 10,000. Then the mean of 
(1/10,000) SjVar(iVj) was computed and the root of it forms column 2 in 
Table 9. The third column was constructed by simply computing the em- 
pirical variance of Ni,i = 1, ..., 10,000. We see that the approximation is 
good (and always conservative) for larger values of N and reasonable for 
smaller values of N. Finally, we would like to mention the bootstrap as 
an alternative to the approximate standard errors given above. The boot- 
strap is straightforward to implement here: first obtain N from the original 
data; then resample (simulate) /q , /* , . . . based on the fitted po,pi,. . .; then 
delete /q and calculate a new N* from the new sample. Replicate this pro- 
cedure B times (say) and from the resulting iV*'s calculate a standard error 
for N, percentile-based confidence intervals, and so forth. 

A.3. Dependence of estimators on the truncation point. Table 10 shows 
the dependence of N vs. that of Nqb on the truncation point for the first 
four data sets considered here. The behavior of N is notably more stable 
than iVcB m this regard, except perhaps for the butterfly data. The nega- 
tive binomial MLE and the coverage-based nonparametric estimators also 
display considerable instability with respect to m, except in the case of the 
butterfly data (results not shown). The only other procedure we know of 
that is relatively robust with respect to m is the parametric estimator based 

Table 9 
Estimated [using (2. 3)] and true standard error for 

WLRM estimator N; N = 100 and N = 1000, 

k— 1,2,4,6,10, fj,— 1; results are based on 10,000 

replications 



k 




S.E.(N) 


True S.E.(N) 


AT = 


:100 






1 




26.94 


23.06 


2 




36.36 


30.00 


4 




44.23 


38.02 


6 




44.13 


38.57 


10 




41.88 


42.21 


AT = 


:1000 






1 




52.31 


52.67 


2 




64.73 


64.36 


4 




72.61 


71.64 


6 




75.68 


73.51 


10 




77.90 


76.12 
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Table 10 

Dependence of the weighted least-squares N and the Chao-Bunge estimator on the 

truncation point, compared for all data sets 





Polyps 


-low 


Polyps 


-hi 


Butterflies 


Microbial 


m 


WLRM 


C-B 


WLRM 


C-B 


WLRM 


C-B 


WLRM 


C-B 


3 


609 


411 


881 


446 


754 


682 


767 


266 


4 


525 


440 


620 


459 


744 


696 


364 


492 


5 


509 


471 


542 


472 


776 


715 


364 


492 


6 


523 


524 


513 


482 


759 


727 


364 


-240 


7 


519 


596 


512 


497 


752 


737 


364 


-240 


8 


503 


643 


519 


532 


746 


746 


364 


-75 


9 


495 


668 


510 


570 


741 


752 


216 


-59 


10 


495 


668 


510 


570 


732 


757 


212 


-49 


11 


495 


844 


510 


586 


726 


761 


214 


-42 


12 


495 


844 


506 


607 


724 


765 


205 


-43 


13 


495 


844 


506 


607 


717 


768 


197 


-45 


14 


495 


844 


506 


607 


718 


774 


195 


-46 


15 


495 


844 


506 


607 


712 


777 


195 


-46 


16 


495 


844 


506 


607 


711 


783 


195 


-46 


17 


495 


844 


506 


607 


708 


788 


195 


-46 


18 


495 


844 


506 


607 


704 


792 


182 


-48 


19 


495 


844 


506 


607 


704 


797 


182 


-48 


20 


495 


844 


506 


607 


701 


802 


182 


-48 


21 


495 


844 


506 


607 


698 


805 


182 


-48 


22 


495 


1821 


506 


607 


695 


807 


182 


-48 


23 


495 


1821 


506 


607 


693 


808 


182 


-48 


24 


495 


1821 


506 


607 


692 


810 


182 


-48 


28 


495 


-2250 


506 


607 






182 


-48 


29 






506 


607 






182 


-43 


31 






506 


1063 






182 


-43 


42 






506 


1063 






182 


-33 


53 






506 


1063 






182 


-27 


77 






506 


-301 











on finite mixtures of geometries (i.e., Poisson where the Poisson mean is 
distributed as a finite mixture of exponentials); for details on this model see 
Bunge and Barger (2008). 
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